Goto

Collaborating Authors

 Incheon








Balanced Online Class-Incremental Learning via Dual Classifiers

Wen, Shunjie, Heinis, Thomas, Choi, Dong-Wan

arXiv.org Artificial Intelligence

Online class-incremental learning (OCIL) focuses on gradually learning new classes (called plasticity) from a stream of data in a single-pass, while concurrently preserving knowledge of previously learned classes (called stability). The primary challenge in OCIL lies in maintaining a good balance between the knowledge of old and new classes within the continually updated model. Most existing methods rely on explicit knowledge interaction through experience replay, and often employ exclusive training separation to address bias problems. Nevertheless, it still remains a big challenge to achieve a well-balanced learner, as these methods often exhibit either reduced plasticity or limited stability due to difficulties in continually integrating knowledge in the OCIL setting. In this paper, we propose a novel replay-based method, called Balanced Inclusive Separation for Online iNcremental learning (BISON), which can achieve both high plasticity and stability, thus ensuring more balanced performance in OCIL. Our BISON method proposes an inclusive training separation strategy using dual classifiers so that knowledge from both old and new classes can effectively be integrated into the model, while introducing implicit approaches for transferring knowledge across the two classifiers. Extensive experimental evaluations over three widely-used OCIL benchmark datasets demonstrate the superiority of BISON, showing more balanced yet better performance compared to state-of-the-art replay-based OCIL methods.


ChronusOmni: Improving Time Awareness of Omni Large Language Models

Chen, Yijing, Wu, Yihan, Guan, Kaisi, Ren, Yuchen, Wang, Yuyue, Song, Ruihua, Ru, Liyun

arXiv.org Artificial Intelligence

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.


Probabilistic Multi-Agent Aircraft Landing Time Prediction

Kim, Kyungmin, Yoon, Seokbin, Lee, Keumjin

arXiv.org Artificial Intelligence

Accurate and reliable aircraft landing time prediction is essential for effective resource allocation in air traffic management. However, the inherent uncertainty of aircraft trajectories and traffic flows poses significant challenges to both prediction accuracy and trustworthiness. Therefore, prediction models should not only provide point estimates of aircraft landing times but also the uncertainties associated with these predictions. Furthermore, aircraft trajectories are frequently influenced by the presence of nearby aircraft through air traffic control interventions such as radar vectoring. Consequently, landing time prediction models must account for multi-agent interactions in the airspace. In this work, we propose a probabilistic multi-agent aircraft landing time prediction framework that provides the landing times of multiple aircraft as distributions. We evaluate the proposed framework using an air traffic surveillance dataset collected from the terminal airspace of the Incheon International Airport in South Korea. The results demonstrate that the proposed model achieves higher prediction accuracy than the baselines and quantifies the associated uncertainties of its outcomes. In addition, the model uncovered underlying patterns in air traffic control through its attention scores, thereby enhancing explainability.


Contract-Driven QoE Auditing for Speech and Singing Services: From MOS Regression to Service Graphs

Du, Wenzhang

arXiv.org Artificial Intelligence

Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs. We propose a contract-driven QoE auditing framework: each service graph G is evaluated under a set of human-interpretable experience contracts C, yielding a contract-level satisfaction vector Q(G, C). We show that (i) classical MOS regression is a special case with a degenerate contract set, (ii) contract-driven quality is more stable than MOS under graph view transformations (e.g., pooling by system vs. by system type), and (iii) the effective sample complexity of learning contracts is governed by contract semantics rather than merely the dimensionality of C. We instantiate the framework on URGENT2024 MOS (6.9k speech utterances with raw rating vectors) and SingMOS v1 (7,981 singing clips; 80 systems). On URGENT, we train a contract-aware neural auditor on self-supervised WavLM embeddings; on SingMOS, we perform contract-driven graph auditing using released rating vectors and metadata without decoding audio. Empirically, our auditor matches strong MOS predictors in MOS accuracy while providing calibrated contract probabilities; on SingMOS, Q(G, C) exhibits substantially smaller cross-view drift than raw MOS and graph-only baselines; on URGENT, difficulty curves reveal that mis-specified "simple" contracts can be harder to learn than richer but better aligned contract sets.